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Abstract 



Actor-Critic based approaches were among the first to address reinforcement learning in a general 
setting. Recently, these algorithms have gained renewed interest due to their generality, good 
convergence properties, and possible biological relevance. In this paper, we introduce an online 
temporal difference based actor-critic algorithm which is proved to converge to a neighborhood of a 
local maximum of the average reward. Linear function approximation is used by the critic in order 
estimate the value function, and the temporal difference signal, which is passed from the critic 
to the actor. The main distinguishing feature of the present convergence proof is that both the 
actor and the critic operate on a similar time scale, while in most current convergence proofs they 
are required to have very different time scales in order to converge. Moreover, the same temporal 
difference signal is used to update the parameters of both the actor and the critic. A limitation 
of the proposed approach, compared to results available for two time scale convergence, is that 
convergence is guaranteed only to a neighborhood of an optimal value, rather to an optimal value 
itself. The single time scale and identical temporal difference signal used by the actor and the 
critic, may provide a step towards constructing more biologically realistic models of reinforcement 
learning in the brain. 

1. Introduction 

In Reinforcement Learning (RL) an agent attempts to improve its performance over time at a 
given task, based on continual interaction with the (usually unknown) environment (Bertsekas and 
Tsitsiklis (1996); Sutton and Barto (1998)). Formally, it is the problem of mapping situations to 
actions in order to maximize a given average reward signal. The interaction between the agent 
and the environment is modeled mathematically as a Markov Decision Process (MDP). Approaches 
based on a direct interaction with the environment, are referred to as simulation based algorithms, 
and will form the major focus of this paper. 

A well known subclass of RL approaches consists of the so called actor-critic (AC) algorithms 
(e.g., Sutton and Barto (1998)), where the agent is divided into two components, an actor and a 
critic. The critic functions as a value estimator, whereas the actor attempts to select actions based 
on the value estimated by the critic. These two components solve their own problems separately 
but interactively. Many methods for solving the critic's value estimation problem, for a fixed poHcy, 
have been proposed, but, arguably, the most widely used is temporal difference (TD) learning. TD 
learning was demonstrated to accelerate convergence by trading bias for variance eflectively Singh 
and Dayan (1998), and is often used as a component of AC algorithms. 

In general, poHcy selection may be randomized. When facing problems with a large number of 
states or actions (or even continuous state-action problems), effective policy selection may suffer 
from several problems, such as slow convergence rate or an inefficient representation of the policy. 
A possible approach to policy learning is the so-called policy gradient method (Baxter and Bartlett 
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(2001); Cao (2007); Cao and Chen (1997); Konda and Tsitsiklis (2003); Marbach and Tsitsiklis 
(1998)). Instead of maintaining a separate estimate for the value for each state (or state-action 
pair), the agent maintains a parametrized poHcy function. The poUcy function is taken to be a 
differentiable function of a parameter vector and of the state. Given the performance measure, 
depending on the agent's poHcy parameters, these parameters are updated using a samphng-based 
estimate of the gradient of the average reward. While such approaches can be proved to converge 
under certain conditions (e.g., Baxter and Bartlett (2001)), they often lead to slow convergence, due 
to very high variance. A more general approach based on sensitivity analysis, which includes poHcy 
gradient methods as well as non-parametric average reward functions, has been discussed in depth 
in the recent manuscript by Cao (2007). 

Several AC algorithms with associated convergence proofs have been proposed recently (a short 
review is given in section 2.2). As far as we are aware, all the convergence results for these algorithms 
are based on two time scales, specifically, the actor is assumed to update its internal parameters on 
a much slower time scale than the one used by the critic. The intuitive reason for this time scale 
separation is clear, since the actor improves its policy based on the critic's estimates. It can be 
expected that rapid change of the policy parameters may not allow the critic to effectively evaluate 
the value function, which may lead to instability when used by the actor in order to re-update its 
parameters. 

The objective of this paper is to propose an online AC algorithm and establish its convergence 
under conditions which do not require the separation into two time scales. There is clear theoretical 
motivation for such an approach, as it can potentially lead to faster convergence rates, although this 
is not a an issue we stress in this work. In fact, our motivation for the current direction was based 
on the possible relevance of AC algorithms in a biological context (e.g. Daw et al. (2006)), where 
it would be difficult to justify two very different time scales operating within the same anatomical 
structure. We refer the reader to DiCastro et al. (2008) for some preliminary ideas and references 
related to these issues. Given the weaker conditions assumed on the time scales, our convergence 
result is, not surprisingly, somewhat weaker than that provided recently in (e.g., Bhatnagar et al. 
(2008a,b)), as we are not ensured to converge to a local optimum, but only to a neighborhood of such 
an optimum. Nevertheless, it is shown that the neighborhood size can be algorithmically controlled. 
Further comparative discussion can be found in section 2. 

This paper is organized as follows. In section 2 we briefly recapitulate current AC algorithms 
for which convergence proofs are available. In section 3, we formally introduce the problem setup. 
We begin section 4 by relating the TD signal to the gradient of the average reward, and then move 
on to motivate and derive the main AC algorithm, concluding the section with a convergence proof. 
A comparative discussion of the main features of our approach is presented in section 5, followed 
by some simulation results in section 6. Finally, in section 7, we discuss the results and point 
out possible future work. In order to facilitate the readability of the paper, we have relegated all 
technical proofs to appendices. 

2. Previous Work 

In this section we briefly review some previous work in RL which bears direct relevance to our work. 
While many AC algorithms have been introduced over the years, we focus only on those for which 
a convergence proof is available, since the main focus of this work is on convergence issues, rather 
than on establishing the most practically effective algorithms (see, for example, Peters and Schaal 
(2008), for promising applications of AC algorithms in a robotic setting). 

2.1 Direct policy gradient algorithms 

Direct policy gradient algorithms, employing agents which consist of an actor only, typically esti- 
mate a noisy gradient of the average reward, and are relatively close in their characteristics to AC 
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algorithms. The main difference from the latter is that the agent does not maintain a separate value 
estimator for each state, but rather interacts with the environment directly, and in a sense maintains 
its value estimate implicitly through a mapping which signifies which path the agent should take in 
order to maximize its average reward per stage. 

Marbach and Tsitsiklis (1998) suggested an algorithm for non-discounted environments. The 
gradient estimate is based on an estimate of the state values which the actor estimates while in- 
teracting with the environment. If the actor returns to a sequence of previously visited states, it 
re-estimates the states value, not taking into account its previous visits. This approach often results 
in large estimation variance. 

Baxter and Bartlett (2001) proposed an online algorithm for partially observable MDPs. In 
this algorithm, the agent estimates the expected average reward for the non-discounted problems 
through an estimate of the value function of a related discounted problem. It was shown that when 
the discount factor approaches 1, the related discounted problem approximates the average reward 
per stage. Similar to the algorithms in (Marbach and Tsitsiklis (1998)), it suffers from relatively 
large estimation variance. In (Baxter et al. (2004)), a method was proposed for coping with the 
large variance by adding a baseline to the value function estimation. 

2.2 Actor Critic Algorithms 

As stated in section 1, the convergence proofs of which we are aware for AC algorithms are based 
on two time scale stochastic approximation (Borkar (1997)), where the actor is assumed to operate 
on a time scale which is much slower than that used by the critic. 

Konda and Borkar (1999) suggested a set of AC algorithms. In two of their algorithms (Algo- 
rithms 3 and 6), parametrized poHcy based actors were used while the critic was based on a lookup 
table. Those algorithms and their convergence proofs were specific to the Gibbs poHcy function in 
the actor. 

As far as we are aware, Konda and Tsitsiklis (2003) provided the first convergence proof for an 
AC algorithm based on function approximation. The information passed from the critic to the actor 
is the critic's action-value function, and the critic's basis functions, which are expHcitly used by 
the actor. They provided a convergence proof of their TD(A) algorithm where A approaches 1. A 
drawback of the algorithm is that the actor and the critic must share the information regarding the 
actor's parameters. This detailed information sharing is a clear handicap in a biological context, 
which was one of the driving forces for the present work. 

Finally, Bhatnagar et al. (2008a, b) recently proposed an AC algorithm which closely resembles 
our proposed algorithm, and which was developed independently of ours. In this work the actor uses 
a parametrized poHcy function while the critic uses a function approximation for the state evaluation. 
The critic passes to the actor the TD(0) signal and based on it the actor estimates the average reward 
gradient. A detailed comparison will be provided in section 5. As pointed out in Bhatnagar et al. 
(2008a, b), their work is the first to provide a convergence proof for an AC algorithm incorporating 
bootstrapping Sutton and Barto (1998), where bootstrapping refers to a situation where estimates 
are updated based on other estimates, rather than on direct measurements (as in Monte Carlo 
approaches). This feature applies to our work as well. We also note that Bhatnagar et al. (2008a, b) 
extend their approach to the so-called natural gradient estimator, which has been shown to lead 
to improved convergence in supervised learning as well as RL. The present study focuses on the 
standard gradient estimate, leaving the extension to natural gradients to future work. 

3. The Problem Setup 

In this section we describe the formal problem setup, and present a sequence of assumptions and 
lemmas which will be used in order to prove convergence of Algorithm 1 in section 4. These assump- 
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tions and lemmas mainly concern the properties of the controlled Markov chain, which represents 
the environment, and the properties of the actor's parametrized policy function. 

3.1 The Dynamics of the Environment and of the Actor 

We consider an agent, composed of an actor and a critic, interacting with an environment. We 
model the environment as a Markov Decision Process (MDP) Puterman (1994) in discrete time 
with a finite state set X and an action set U, which may be uncountable. We denote by \X\ the size 
of the set X. Each selected action u €U determines a stochastic matrix P{u) = [P{y\x,u)]x^y^x 
where P{y\x, u) is the transition probability from a state x € X io a state y € X given the control 
u. For each state x G X the agent receives a corresponding reward r{x), which may be deterministic 
or random. In the present study we assume for simplicity that the reward is deterministic, a benign 
assumption which can be easily generaHzed. 

Assumption 3.1 The rewards, {r{x)}a:£x , are uniformly bounded by a finite constant Br- 

The actor maintains a parametrized policy function. A parametrized policy function is a conditional 
probability function, denoted by fi{u\x,9), which maps an observation x G X into a control u £U 
given a parameter 9 € R^. The agent's goal is to adjust the parameter in order to attain maximum 
average reward over time. For each 0, we have a Markov Chain (MC) induced by P{y\x,u) and 
//(mIx, 9). The state transitions of the MC are obtained by first generating an action u according to 
li{u\x,9), and then generating the next state according to {P{y\x,u)}x,y&x- Thus, the MC has a 
transition matrix P{9) = [P{y\x,9)]x,yex which is given by 

P{y\x,9)^ f P{y\x,u)d^,iu\x,e). (1) 
Ju 

We denote the space of these transition probabilities hj V = {P{9)\9 G R'^}, and its closure by 
P. The following assumption is needed in the sequel in order to prove the main results (see Bremaud 
(1999) for definitions). 

Assumption 3.2 Each MC, P{0) G V , is aperiodic, recurrent, and irreducible. 

As a result of Assumption 3.2, we have the following lemma regarding the stationary distribution 
and a common recurrent state. 

Lemma 3.3 Under Assumption 3.2 we have: 

1. Each MC, P{9) G P, has a unique stationary distribution, denoted by tt{9), satisfying 'k{9)' P(9) -- 

<oy. 

2. There exists a state, denoted by x* , which is recurrent for all P{9) G P. 

Proof For the first part see Corollary 4.1 in (Gallager, 1995). The second part follows trivially 
from Assumption 3.2. ■ 

The next technical assumption states that the first and second derivatives of the parametrized poHcy 
function are bounded, and is needed to prove Lemma 3.6 below. 

Assumption 3.4 The conditional probability function ^J.{u\x, 9) is twice differentiable. Moreover, 
there exist positive constants, and B^^, such that for all x G X , u G U , 9 E and fci > 
1, k2 < K we have 



dfi{u\x, 9) 



d9k 



d^fiiu\x,9) 



d9k,d9k. 



A notational comment concerning bounds Throughout the paper we denote upper bounds on 
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different variables by the letter B, with a subscript corresponding to the variable itself. An additional 
numerical subscript, 1 or 2, denotes a bound on the first or second derivative of the variable. For 
example, Bf, Bf^, and Bf^ denote the bounds on the function f and its first and second derivatives 
respectively. 



3.2 Performance Measures 

Next, we define a performance measure for an agent in an environment. The average reward per 
stage of an agent which traverses a MC starting from an initial state x € X is defined by 



J(x, 0) = lim E 



1 

- r{xn) 



n=0 



Xo 



where E[\6] denotes the expectation under the probability measure P{0), and Xn is the state at time 
n. The agent's goal is to find 9 S which maximizes J{x,6). The following lemma shows that 
under Assumption 3.2, the average reward per stage does not depend on the initial state (Bertsekas 
(2006), vol. II, section 4.1). 

Lemma 3.5 Under Assumption 3.2 and based on Lemma 3.3, the average reward per stage, J(x, 9), 
is independent of the starting state, is denoted by ri[9), and satisfies ri{9) = ■n{9)'r. 

Based on Lemma 3.5, the agent's goal is to find a parameter vector 9, which maximizes the average 
reward per stage r]{9). In the sequel we show how this maximization can be performed by optimizing 
r]{9), using V0r]{9). A consequence of Assumption 3.4 and the definition of ri{9) is the following 
lemma. 

Lemma 3.6 

1. For each a;, y e X,l<i,j < K, and 6* e , the functions dP{y\x, 9)/ ^9^ and d^P{y\x, 9)/ ^9^^9J 
are uniformly bounded by Bp^ and Bp^ respectively. 

(a) For each x e X , 1 < i,j < K , and 9 e R'"^ , the functions dTT{x\9)/d9t and d'^TT{x\9)/d9id9j 
are uniformly bounded by , B^y-^ and Bj^.^ respectively. 

(b) For all 1 < i,j < K, and 9 G R^, the functions 77(6*), dri{9)/d9^ and d^T:{x\9) / 89^89 j are 
uniformly bounded by , i?,,, Brj^ and respectively. 

(c) For all a; G A" and 9 S R^, there exists a constant 6^ > such that 7r(2;|6') > b.,^. 

The proof is technical and is given in Appendix A.l. For later use, we define the random variable 
T, which denotes the first return time to the recurrent state x* . Formally, 

T = min{fc > Ojxo = a;*, Xk = x*}. (2) 

It is easy to show that under Assumption 3.2, the average reward per stage can be expressed by 



lim E 

T^oo 



T-1 



(3) 



Next, we define the differential value function of state x £ X which represents the average differential 
reward the agent receives upon starting from a state x and reaching the recurrent state x* for the 
first time. Mathematically, 



h{x,9) = E 



T-1 



J2{r{x^)-m) 



n=0 



Xo 



(4) 
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Abusing notation slightly, we denote h{e) = {h{xi,e), h{x\x\,d)) & R''^'- For each 9 e and 
X G X, h{x,9), r{x), and ri{9) satisfy Poisson's equation (see Theorem 7.4.1 in (Bertsekas (2006))), 
i.e., 

h{x, 9) = r{x) - r]{9) + ^ P{y\x, 9)h{y, 9). (5) 

Based on the differential value we define the temporal difference (TD) between the states x € X and 
y G X (see Bertsekas and Tsitsiklis (1996), Sutton and Barto (1998)), 

d{x, y, 9) = r{x) - 7]{9) + h{y, 9) - h{x, 9). (6) 

According to common wisdom, the TD is interpreted as a prediction error. The next lemma states 
the boundedness of h{x, 9) and its derivatives. The proof is given in Appendix A. 2. 

Lemma 3.7 



1. The differential value function, h{x, 9), is bounded and has bounded first and second derivative. 
Mathematically, for all x G X, 1 < i,j < K, and for all 9 £ M.^ we have 



\hix,9)\ < Bh, 



dh{x,9) 



39,, 



<Bh„ 



d^h{x,9) 



d9id9j 



(a) There exists a constant Bjj such that or all 9 e we have \d{x,y,9)\ < Bjj, where 
Bd = 2 {Br + Bh). 



3.3 The Critic's Dynamics 

The critic maintains an estimate of the environmental state values. It does so by maintaining a 
parametrized function which approximates h{x,9), and is denoted by h{x,w). The function h(x,w) 
is a function of the state x G X and a parameter w € R-^. We note that h{x, 9) is a function of 6*, and 
is induced by the actor policy ^x{u\x, 9), while h{x, w) is a function of w. Thus, the critic's objective 
is to find the parameter w which yields the best approximation of h{9) = {h{xi,9), . . . , h{x\x\ , 6)), in 
a sense to be defined later. We denote this optimal vector by w*{9). An illustration of the interplay 
between the actor, critic, and the environment is given in Figure 1. 



4. A Single Time Scale Actor Critic Algorithm with Linear Function 
Approximation 

In this section, we present a version of an AC algorithm, along with its convergence proof. The core 
of the algorithm is based on (7) below, where the actor's estimate of ^er]{9) is based on the critic's 
estimate of the TD signal d{x,y,9). The algorithm is composed of three iterates, one for the actor 
and two for the critic. The actor maintains the iterate of the parameter vector 9 corresponding 
to the policy iJ,{u\x,9), where its objective is to find the optimal value of 9, denoted by 9*, which 
maximizes r]{9). The critic maintains the other two iterates. One iterate is used for estimating 
the average reward per stage, r]{9), where its estimate is denoted by fj. The critic's second iterate 
maintains a parameter vector, denoted by w G K^, which is used for the differential value estimate 
using a function approximator, denoted by h{w). For each 9 G R^, there exists a w*{9) which, 
under the policy induced by 9, is the optimal w for estimating fj{w). The critic's objective is to find 
the optimal fj and w. 
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Agent 




Critic 



Environment 



Figure 1: A schematic illustration of the dynamics between the actor, the critic, and the environ- 
ment. The actor chooses an action, u„, according to the parametrized policy fi{u\x,9). 
As a result, the environment proceeds to the next state according to the transition proba- 
bility P{Xn+i\xn,Un) and provides a reward. Using the TD signal, the critic improves its 
estimation for the environment state values while the actor improves its poHcy. 



4.1 Using the TD Signal to Estimate the Gradient of the Average Reward 

We begin with a theorem which serves as the foundation for the policy gradient algorithm described 
in Section 4. The theorem relates the gradient of the average reward per stage, ?7(0), to the TD signal. 
It was proved in (Bhatnagar et al. (2008a)), and is similar in its structure to other theorems which 
connect 77(6*) to the Q-value (Konda and Tsitsiklis (2003)), and to the diflerential value function 
(Cao (2007); Marbach and TsitsikHs (1998)). 

We start with a definition of the likelihood ratio derivative 

V'(x,.,0)^^''^(""^'^) 



fi{u\x,9) 

where the gradient Ve is w.r.t. 9, and t/j{x,u,9) E M^'. The following assumption states that 
ip{x, u, 9) is bounded, and will be used to prove the convergence of algorithm 1. 

Assumption 4.1 For all x E X , u E U, and 9 G M^, there exists a positive constant, B^, such 
that 

\\il^{x,u,9)\\^ < 5,0 < 00, 
where \\ ■ II2 is the Euclidean L2 norm. 

Based on this, we present the following theorem which relates the gradient of r]{9) to the TD signal. 
For completeness, we supply a (straightforward) proof in Appendix B. 

Theorem 4.2 For any arbitrary function f{x), the gradient w.r.t. 9 of the average reward per stage 
can he expressed by 

^ev{0)= J2 P{^,u,y,9)ij{x,u,9)d{x,y,9), (7) 

x.yeX 

where P{x, u, y, 9) is the probability Pr(a;,i ~ x,Un ~ u, Xn+i = y) subject to the policy parameter 9. 
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4.2 The updates performed by the critic and the actor 

We note that the following derivation regarding the critic is similar in some respects to the derivation 
in section 6.3.3 of Bertsekas and Tsitsiklis (1996) and of Tsitsiklis and Roy (1997). We define 
the following quadratic target function used to evaluate the critic's performance in assessing the 
differential value h{9), 



X, w) — h{x, 



(8) 



The probabilities {'!r{x\0)}xex are used in order to provide the proportional weight to the state 
estimates, according to the relative number of visits of the agent to the different states. 

Limiting ourselves to the class of Hnear function approximations in the critic, we consider the 
following function for the differential value function 



h{x, w) = (j>{xyw, 

where 4'{x) e M^. We define $ G IR!'*'!^'^ to be the matrix 



(9) 



01(2^2) 



<I>l{xi) \ 

(f>Lix2) 



where 



\ (f>i{x\x\) hix\x\) ■•■ (t)Lixix\) J 
•) is a column vector. Therefore, we can express (9) in vector form as 

h{'w) = ^w, 



(10) 



where, abusing notation sHghtly, we set h{w) = yh{xi,w), . . . ,h{x\x\,w) 

We wish to express (8), and the approximation process, in an appropriate Hilbert space. Define 
the matrix Il{6) to be a diagonal matrix 11(6') = diag(7r(0)). Thus, (8) can be expressed as 

Iiw,9) = i ||n(0)^ {h{e) - $u;)||^ A 1 _ $,,||2^^^^ (H) 

In the sequel, we will need the following technical assumption. 
Assumption 4.3 

1. The columns of the matrix $ are independent, i.e., they form a basis of dimension L. 

(a) The norms of the column vectors of the matrix $ are bounded above by 1, i.e., |l0fe||2 < 1 
for 1 < fc < L. 

The parameter w*{9), which optimizes (11), can be directly computed, but involves inverting a 
matrix. Thus, in order to find the right estimate for h{w), the following gradient descent (Bertsekas 
and Tsitsiklis (1996)) algorithm is suggested. 



Wn+i = Wn - 7„V^/(u;„, 6*), 



(12) 



where {'jn}^=i is a positive series satisfying the following assumption, which will be used in proving 
the convergence of Algorithm 1 . 

Assumption 4.4 The positive series {'jn}'^=i satisfies 

00 00 

^7„ = oo, ^7^<oo. (13) 

n— 1 n—1 
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Writing the term Vtu/(u'„) explicitly yields 

v.^iiwn) = $'n(0)$u;„ - $'n(0)/i(0). (14) 

For each G M.^ , the value w*{d) is given by setting Vu,/(w, 6) = 0, i.e., 

w*i9) = ($'0(6*)$)"^ $'n(6i)/i(6'). (15) 

Note that Bertsekas and Tsitsiklis (1996) prove that the matrix {^'U{d)^y^ $'11(6') is a projection 
operator into the space spanned by ^w, with respect to the norm |Hln(e) • Thus, the explicit gradient 
descent procedure (12) is 

w„+i = w„ - 7„$'n {9) ($«;„ - h{e)) . (16) 
Using the basis $, in order to approximates h (9), yields an approximation error defined by 

eapp (9) ^ inf \\h (9) - $«;||^(,) = \\h{9) - <i>w* (0)||,(,) . 

We can bound this error by 

fapp = sup eapp (9) . (17) 

The agent cannot access h{x, 9) directly. Instead, it can interact with the environment in order 
to estimate h{x, 9). We denote by /i„(x) the estimate of h{x, 9) at time step n, thus (16) becomes 

Wn+l =W^+ 7n*'n(6') [hn - $Wn) ■ (18) 

This procedure is termed stochastic gradient descent (Bertsekas and Tsitsiklis (1996)). 

There exist several estimators for One sound method, which performs well in practical 
problems (see Tesauro (1995)), is the TD(A) method (see section 5.3.2 and 6.3.3 in Bertsekas and 
Tsitsiklis (1996), or Chapter 6 in Sutton and Barto (1998)), where the parameter A satisfies < A < 
1. This method devises an estimator which is based on previous estimates of h (w), i.e., ui„, and is 
based also on the environmental reward r (a;„). This idea is a type of a bootstrapping algorithm, i.e., 
using existing estimates and new information in order to build more accurate estimates (see Sutton 
and Barto (1998), Section 6.1). 

The TD(A) estimator for /i„+i is 

/i„+i(x„) - (l-A)^A'^/ii'=|i(a;„), (19) 

k=0 



where the k-steps predictor is defined by 

\m=0 / 

The idea of bootstrapping is apparent in (19): the predictor for the differential value of the state 
Xn at the (n + 1)-Th time step, is based partially on the previous estimates through /i„ (xn+k+i), 
and partially on new information, i.e., the reward r (xn+m)- In addition, the parameter A gives an 
exponential weighting for the different k-step predictors. Thus, choosing the right A can yield better 
estimators. 
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For the discounted setting, it was proved by Bertsel-cas and Tsitsiklis (1996) (p. 295) that an 
algorithm which implements the TD(A) estimator (19) online and converges to the right value is the 
following one 

e„ = aXcn-i + (I) (xn) , (20) 

where c?„ is the temporal difference between the n-th and the {n + l)-th cycle, and e„ is the so-called 
eligibility trace (see Sections 5.3.3 and 6.3.3 inBertsekas and Tsitsiklis (1996) or Chapter 7 in Sutton 
and Barto (1998)), and the parameter a is the discount factor. The eligibility trace is an auxiliary 
variable, which is used in order to implement the idea of (19) as an online algorithm. As the name 
implies, the ehgibility variable measures how eligible is the TD variable, dn, in (20). 
In our setting, the non-discounted case, the analogous equations for the critic, are 

Wn+l = Wn+"fnd{Xn,Xn+l,Wn)en 
d{Xn,Xn+l,Wn) = r{Xn) - Vni + h{Xn+l, Wm) - h{Xn , Wm) (21) 

en = Ae„_i + (/) (a;„) . 

The actor's iterate is motivated by Theorem 4.2. Similarly to the critic, the actor executes 
a stochastic gradient ascent step in order to fior with a parametrized policy /i(u|a;, 0) satisfying 
Assumptions 3.4 and 4.1. 

• A critic with 

nd a local maximum of the average reward per stage ri{9). Therefore, 

)dn{Xn,Xn+l,Wn). (22) 

A summary of the algorithm is presented in Algorithm 1. 
4.3 Convergence Proof for the AC Algorithm 

In the remainder of this section, we state the main theorems related to the convergence of Algo- 
rithm 1. We present a sketch of the proof in this section, where the technical details are relegated 
to Appendices C and D. The proof is divided into two stages. In the first stage we relate the 
stochastic approximation to a set of ordinary differential equations (ODE). In the second stage, we 
find conditions under which the ODE system converges to a neighborhood of the optimal r]{9). 

The ODE approach is a widely used method in the theory of stochastic approximation for in- 
vestigating the asymptotic behavior of stochastic iterates, such as (23)-(25). The key idea of the 
technique is that the iterate can be decomposed into a mean function and a noise term, such as a 
martingale difference noise. As the iterates advance, the effect of the noise weakens due to repeated 
averaging. Moreover, since the step size of the iterate decreases (e.g., 7„ in (23)-(25)), one can show 
that asymptotically an interpolation of the iterates converges to a continuous solution of the ODE. 
Thus, the first part of the convergence proof is to find the ODE system which describes the asymp- 
totic behavior of Algorithm 1. This ODE will be presented in Theorem 4.6. In the second part we 
use ideas from the theory of Lyapunov functions in order to characterize the relation between the 
constants, \X\, F^, Tm, etc., which ensure convergence to some neighborhood of the maximum point 
satisfying |jV6/?/(6')||2 — 0. Theorem 4.7 states conditions on this convergence. 

4.3.1 Relate the Algorithm to an ODE 

In order to prove the convergence of this algorithm to the related ODE, we need to introduce the 
following assumption, which adds constraints to the iteration for w, and will be used in the sequel 
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Algorithm 1 TD AC Algorithm 
Given: 

• An MDP with a finite set X of states satisfying Assumption 3.2. 

• An actor with a parametrized poHcy n{u\x,6) satisfying Assumptions 3.4 and 4.1. 

• A critic with a Unear basis for h{w), i.e., satisfying Assumption 4.3. 

• A set H, a constant i?^,, and an operator '^w according to Definition 4.5. 

• Step parameters and Th- 

• Choose a TD parameter < A < 1. 
For step n = : 

• Initiate the critic and the actor variables: % = ,Wo = 0, Cq = 0, = 0. 

For each step n= 1,2,... 

Critic: Calculate the estimated TD and eligibility trace 

fln+l = fjn + 7,ir,, {r{Xn) ~ fjn) (23) 
h{x,Wn) = w'n^ix), 
d{Xn,Xn+l,Wn) = r{Xn) ~ f/n + h{Xn+l, Wn) - h{Xn, Wn) , 

en = Ae„_i + (?!)(a;„) • 

Set, 

dCramer' s {xn,Xn+i,Wn) Cn (24) 

Actor: 

0,1+1 = 0n + Jn1p{x 

) (25) 

Project each component of Wm+i onto H (see Definition 4.5) 
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to prove Theorem 4.6. This assumption may seem restrictive at first but in practice it is not. The 
reason is that we usually assume the bounds of the constraints to be large enough so the iterates 
practically do not reach those bounds. For example, under Assumption 3.2 and additional mild 
assumptions, it is easy to show that h{9) is uniformly bounded for all £ M^. As a result, there 
exist a constant bounding w* {9) for all 6 e . Choosing constraints larger than this constant will 
not influence the algorithm performance. 

Definition 4.5 Let us denote by {wi}f^i the components of w, and choose a positive constant B^j. 
We define the set H ClR^ xR^ to be 

H = {{e, w) |-oo < 61, < oo, l<i< A', -Bu, < Wj < B.^,, I < j < L} , 

and let be an operator which projects w onto H , i.e., for each Cramer' si < j < L, '^^Wj = 
max(min(u;j , —Bw). 

The following theorem identifies the ODE system which corresponds to Algorithm 1. The detailed 
proof is given in Appendix C. 

Theorem 4.6 Define the following functions: 



G(9) = $'n {9) A™P {9r , 

m=0 

jj{x,u,v)^^Q^ = T:{x)P{u\x,9)P{y\x,u)^l){x,u,9) , x,y(zX, ueU. (26) 
A (9) ^ $'n (9) {M (9) - I) 



M {9) = (1 - A) ^ A™P (0)"'+' , 

m=0 
oo 

h{9) = $'n(^) ^ A™P(0)'"(r-r;(0)). 



m=0 

Then, 

1. Algorithm 1 converges to the invariant set of the following set of ODEs 

9 =Vev{9) + ^'"'"'^ (^) (^(^' 2/' ^) - 2/' ' 

w =^^. [r„ {A {9)w + b (9) + G{9) (77(0) - 77))] , ^^^^ 

I ?7 ijiifi) - fi) , 

with probability 1. 
(a) The functions in (26) are continuous with respect to 9. 

4.3.2 Investigating the ODE Asymptotic Behavior 

Next, we quantify the asymptotic behavior of the system of ODEs in terms of the various algorithmic 
parameters. The proof of the theorem appears in Appendix D. 

Theorem 4.7 Consider the constants and as defined in Algorithm 1, and the function ap- 
proximation bound e as defined in (17). Setting 

^Vt; — — p 1 ^ H ^Atd3<^app, 
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where B^tdi, B^tdi, B^tda are a finite constants depending on the MDP and agent parameters. 
Then, the ODE system (27) satisfies 

liminf||Ver/(0t)ll < -Bv^- (28) 

Theorem 4.7 has a simple interpretation. Consider the trajectory rjiOt) for large times, corresponding 
to the asymptotic behavior of rjn. The result implies that the trajectory visits a neighborhood of 
a local maximum infinitely often. Although it may leave the local vicinity of the maximum, it is 
guaranteed to return to it infinitely often. This occurs, since once it leaves the vicinity, the gradient 
of 77 points in a direction which has a positive projection on the gradient direction, thereby pushing 
the trajectory back to the vicinity of the maximum. It should be noted that in simulation (reported 
below) the trajectory usually remains within the vicinity of the local maximum, rarely leaving it. 
We also observe that by choosing appropriate values for and Tw we can control the size of the 
ball to which the algorithm converges. 

The key idea required to prove the Theorem is the following argument. If the trajectory does 
not satisfy |lV77(0)|j2 < -Bvr), we have fj{6) > e for some positive e. As a result, we have a monotone 
function which increases to infinity, thereby contradicting the boundedness of 77(6*). Thus, 77(6*) must 
visit the set which satisfies ||V77(0)||2 < B^^ infinitely often. 



5. A Comparison to other convergence results 

In this section, we point out the main differences between Algorithm 1, the first algorithm proposed 
by Bhatnagar et al. (2008b) and the algorithms proposed by Konda and Tsitsiklis (2003). The main 
dimensions along which we compare the algorithms are the time scale, the type of the TD signal, 
and whether the algorithm is on line or off line. 

The Time Scale and Type of Convergence 

As was mentioned previously, the algorithms of Bhatnagar et al. (2008b) and Konda and Tsitsiklis 
(2003) need to operate in two time scales. More precisely, this refers to the following situation. 
Denote the time step of the critic's iteration by 7^ and the time step of the actor's iteration by 7^, 
we have 7^, = 0(7^), i.e., 

lim ^ = 0. 

In 

The use of two time scales stems from the need of the critic to give an accurate estimate of the 
state values (as in the work of Bhatnagar et al. (2008b)) or the state-action values (as in the work 
of Konda and Tsitsiklis (2003)) before the actor uses them. 

In the algorithm proposed here, a single time scale is used for the three iterates of Algorithm 1. 
We have 7° = 7„ for the actor iterate, 7^^'' — r,,7„ for the critic's T]n iterate, and 7^''" — T^jn for 
the critic's w iterate. Thus, 

lim 

n — 'oo '7'^ 
In 

lim 

n — ^00 7^ 

Due to the single time scale. Algorithm 1 hasthe potential to converge faster than algorithms 
based on two time scales, since both the actor and the critic may operate on the fast time scale. The 
drawback of Algorithm 1 is the fact that convergence to the optimal value cannot be guaranteed, as 
was proved by Bhatnagar et al. (2008b) and by Konda and Tsitsiklis (2003). Instead, convergence to 
a neighborhood in around the optimal value is guaranteed. In order to make the neighborhood 
smaller, we need to choose and F^, appropriately, as is stated in Theorem 4.7. 



— r 

— r 

J- II! • 
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The TD Signal, the Information Passed Between the Actor and the Critic, and the 
Critic's Basis 

The algorithm presented in Bhatnagar et al. (2008b) is essentially a TD(0) algorithm, while the 
algorithm in Konda and TsitsikHs (2003) is TD(1), Our algorithm is a TD(A) for < A < 1. 
A major difference between the approaches in Bhatnagar et al. (2008b) and the present work, as 
compared to Konda and Tsitsiklis (2003), is the information passed from the critic to the actor. In the 
former cases, the information passed is the TD signal, while in the latter case the Q- value is passed. 
Additionally, in Bhatnagar et al. (2008b) and in Algorithm 1 the critic's basis functions do not 
change through the simulation, while in Konda and Tsitsiklis (2003) the critic's basis functions are 
changed in each iteration according to the actor's parameter 9. Finally, we comment that Bhatnagar 
et al. (2008b) introduced an additional algorithm, based on the so-called natural gradient, which led 
to improved convergence speed. In this work we limit ourselves to algorithms based on the regular 
gradient, and defer the incorporation of the natural gradient to future work. As stated in Section 
1, our motivation in this work was the derivation of a single time scale online AC algorithm with 
guaranteed convergence, which may be applicable in a biological context. The more complex natural 
gradient approach seems more restrictive in this setting. 

6. Simulations 

We report empirical results applying Algorithm 1 to a set of abstract randomly constructed MDPs 
which are termed Average Reward Non-stationary Environment Test-bench or in short garnet 
(Archibald et al. (1995)). garnet problems comprise a class of randomly constructed finite MDPs 
serving as a test-bench for control and RL algorithms optimizing the average reward per stage. A 
GARNET problem is characterized in our case by four parameters and is denoted by GARNEt(X, U, B, a) 
The parameter X is the number of states in the MDP, U is the number of actions, B is the branching 
factor of the MDP, i.e., the number of non-zero entries in each line of the MDP's transition matrices, 
and a is the variance of each transition reward. 

We describe how a GARNET problem is generated. When constructing such a problem, we 
generate for each state a reward, distributed normally with zero mean and unit variance. For each 
state-action the reward is distributed normally with the state's reward as mean and variance a^. 
The transition matrix for each action is composed of B non-zero terms in each line which sum to 
one. 

We note that a comparison was carried out by Bhatnagar et al. (2008b) between their algorithm 
and the algorithm of Konda and Tsitsiklis (2003). We therefore compare our results directly to the 
more closely related former approach (see also Section 5). 

We consider the same garnet problems as those simulated by Bhatnagar et al. (2008b). For 
completeness, we provide here the details of the simulation. For the critic's feature vector, we use a 
linear function approximation h{x, w) = 4){x)''w, where (j){x) e {0, 1}^, and define I to be the number 
nonzero values in 4>{x). The nonzero values are chosen uniformly at random, where any two states 
have different feature vectors. The actor's feature vectors are of size L x \U\, and are constructed as 



Bhatnagar et al. (2008b) reported simulation results for two GARNET problems: GARNEt(30, 4, 2, 0.1) 
and GARNEt(100, 10,3,0.1). For the GARNEt(30, 4, 2, 0.1) problem, Bhatnagar et al. (2008b) used 
critic steps 7^'™ and 7^''', and actor steps 7^, where 



Lx(ii-l) Lx(\U\-u) 




i{x,u) 4 (O,...,O,0(x), 0,...,0 , 



c,w 
n 



100 



n 



0.957; 



n 



71 



1000 



7i 



1000 4- n2/3 ' 



7; 



7i 



100000 + n 



14 



A Convergent Online Single Time Scale Actor Critic Algorithm 




(a) (b) 

Figure 2: Simulation results applying Algorithm 1 (red solid line) and algorithm 1 of Bhatnagar 
et al. (2008b) (blue dashed line) on a GARNEt(30, 4, 2, 0.1) problem (a) and on GAR- 
NEt(100, 10, 3, 0.1) problem (b). Standard errors of the mean (suppressed for visibility) 
are of the order of 0.04. 



and for GARNEt(100, 10, 3, 0.1) the steps were 



10^ 10^ 

^C'-^ — - ^ Q n^^c,w a,ri 

In 1 nfi I 9/.^!' "■'^"/n ' In 



106+^2/3' 'n ■ /« : /„ 108 + n' 

In our simulations we used a single time scale, 7„, which was equal to 7^'™ as used by Bhatnagar 
et al. (2008b). The basis parameters for GARNEt(30, 4, 2, 0.1) were L = 8 and / = 3, where for 
GARNEt(100, 10, 3, 0.1) they were L = 20 and Z = 5. 

In Figures 2 we show results of applying Algorithm 1 (solid line) and algorithm 1 from Bhatnagar 
et al. (2008b) (dashed line) on GARNEt(30, 4, 2, 0.1) and GARNEt(100, 10, 3, 0.1) problems. Each 
graph in Figure 2, represents an average of 100 independent simulations. Note that an agent with 
a uniform action selection policy will attain an average reward per stage of zero in these problems. 
Figure 3 presents similar results for GARNEt(30, 15, 15, 0.1). We see from these results that in all 
simulations, during the initial phase. Algorithm 1 converges faster than algorithm 1 from Bhatnagar 
et al. (2008b). The long term behavior is problem-dependent, as can be seen by comparing figures 2 
and 3; specifically, in Figure 2 the present algorithm converges to a higher value than Bhatnagar et al. 
(2008b), while the situation is reversed in Figure 3. We refer the reader toMokkadem and Pelletier 
(2006) for careful discussion of convergence rates for two time scales algorithms; a corresponding 
analysis of convergence rates for single time scale algorithms is currently an open problem. 

The results displayed here suggest a possible avenue for combining both algorithms. More con- 
cretely, using the present approach may lead to faster initial convergence due to the single time scale 
setting, which allows both the actor and the critic to evolve rapidly, while switching smoothly to a 
two time scales approach as in (Bhatnagar et al. (2008b)) will lead to asymptotic convergence to a 
point rather than to a region. This type of approach is reminiscent of the quasi-Newton algorithms 
in optimization, and is left for future work. As discussed in Section 5, we do not consider the natural 
gradient based algorithms from Bhatnagar et al. (2008b) in this comparative study. 



7. Discussion and Future Work 

We have introduced an algorithm where the information passed from the critic to the actor is the 
temporal difference signal, while the critic applies a TD(A) procedure. A policy gradient approach 
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Figure 3: Simulation results applying Algorithm 1 (red solid line) and algorithm 1 of Bhatnagar 
et al. (2008b) (blue dashed line) on a GARNEt(30, 15, 15, 0.1) problem. Standard errors of 
the mean (suppressed for visibility) are of the order of 0.018. 



was used in order to update the actor's parameters, based on a critic using linear function ap- 
proximation. The main contribution of this work is a convergence proof in a situation where both 
the actor and the critic operate on the same time scale. The drawback of the extra flexibility in 
time scales is that convergence is only guaranteed to a neighborhood of a local maximum value of 
the average reward per stage. However, this neighborhood depends on parameters which may be 
controlled to improve convergence. 

This work sets the stage for much future work. First, as observed above, the size of the conver- 
gence neighborhood is inversely proportional to the step sizes and F^. In other words, in order 
to reduce this neighborhood we need to select larger values of r„, and F,, . This on the other hand 
increases the variance of the algorithm. Therefore, further investigation of methods which reduce 
this variance are needed. However, the bounds used throughout are clearly rather loose, and can- 
not be effectively used in practical applications. Obviously, improving the bounds, and conducting 
careful numerical simulations in order to obtain a better practical understanding of the influence of 
the different algorithmic parameters, is called for. In addition, there is clearly room for combining 
the advantages of our approach with those of AC algorithms for which convergence to a single point 
is guaranteed, as discussed in Section 6, 

From a biological point of view, our initial motivation to investigate TD based AC algorithms 
stemmed from questions related to the implementation of RL in the mammalian brain. Such a view 
is based on an interpretation of the transient activity of the neuromodulator dopamine as a TD 
signal (e.g., Schultz (2002)). Recent evidence suggested that the dorsal and ventral striatum may 
implement the actor and the critic, respectively (e.g.. Daw et al. (2006)). We believe that theoretical 
models such as (Bhatnagar et al. (2008b)) and Algorithm 1 may provide, even if partially, a firm 
foundation to theories at the neural level. Some initial attempts in a neural setting (using direct 
policy gradient rather than AC based approaches) have been made by Baras and Meir (2007) and 
Florian (2007). Such an approach may lead to functional insights as to how an AC paradigm may 
be implemented at the cellular level of the basal ganglia and cortex. An initial demonstration was 
given by DiCastro et al. (2008). 

From a theoretical perspective many issues remain open. First, strengthening Theorem (4.7) 
by replacing liminf by lim would clearly be useful. Second, extending the recent convergence rate 
results in Mokkadem and Pelletier (2006) to the single time scale case is an important challenging 
problem. Third, systematically combining the advantages of single time scale convergence (fast initial 
dynamics) and two time scales approaches (convergence to a point) would clearly be beneficial. 
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APPENDIX 



Appendix A. Proofs of Results from Section 3 



A.l Proof of Lemma 3.6 

1. Looking at (1) we see that P{y\x, 9) is a compound function of an integral and a twice differen- 
tiable function, ^{y\x,0), with bounded first and second derivatives according to Assumption 
3.4. Therefore, P{y\x, 9) is a twice differentiable function with bounded first and second 
derivatives for all 9 e 



2. According to Lemma 3.3, for each 9 e we have a unique solution to the following non- 
homogeneous linear equation system in {T^{i\9)}\^[, 



\x\ 

E 



7r{t\9)P{j\t,9)=TT{j\9), j = l,...,\X\^l, 



(29) 



^4z|0) = l, 



or in matrix form M{9)tt{9) = h. By Assumption 3.2, the equation system (29) is invertible, 
therefore, det[M(0)] > 0. This holds for all P{9) G P, thus, there exists a positive constant, 
6m, which uniformly lower bounds det[Af (0)] for all 9 G M^'.Thus, using Cramer's rule we have 



TT{i\9) = 



Q(»,g) 
det[A/(6i)]' 



where Q{i, 9) is a finite polynomial of {P{j\i, 9)}iji^x of at most degree \X\ and with at most 
\X\\ terms. Writing dn{x\9)/d9i explicitly gives 



dTT{x\9) 



39., 



det[M(0)] J-Q(*, 9) - Q{i, 0) J- Aet[M{9)] 



< 



< 



^Q{i,9) 



00. 



det[M(6i)] 



det[M(6i)]2 
Q{i,9)^dci[M[9)] 



dci[M{9)Y 
{\X\-\X\\)-Bp, 



which gives the desired bound. Following similar steps we can show the boundedness of the 
second derivatives. 

X I 

3. The average reward per stage, rjiO) is a linear combination of {7r(i|0)}^^i, with bounded 
coefficients by assumption 3.1. Therefore, using section 2, ri{9) is twice differentiable with 
bounded first and second derivatives for all 9 E M.^ . 

4. Since 7r(x|6') is the stationary distribution of a recurrent MC, according to Assumption 3.2 
there is a positive probability to be in each state x G X. This applies to the closure of V. 
Thus, there exist a positive constant 6^ such that Tr{x\9) > 6^. 
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A. 2 Proof of Lemma 3.7 

1. We recall the Poisson equation (5). We have the following system of linear equations in 
{h{x\6)}.j;^x, namely, 

h{x\0) = r{x) - ri{e) + ^ P{y\x, 9)h{y\9), yxeX,x^x*, 

yex (30) 

h{x*\9) ^ 0. 

or in matrix form N{9)h{9) = c. Adding the equation h{x*\9) = yields a unique solution 
for the system (Bertsekas (2006), Vol. 1, Prop. 7.4.1). Thus, using Cramer's rule we have 
h{x\9) = R(x, 9)1 Act[N(9)], where R(x, 9) and dct[iV(0)] are polynomial function of entries in 
N(9), which are bounded and have bounded first and second derivatives according to Lemma 
3.6. Continuing in the same steps of Lemma 3.6 proof, we conclude that h{x\9) and its two 
first derivatives for all x £ X and for all 9 e M^. 

2. Trivially, by (6) and the previous section the result follows. 
Appendix B. Proof of Theorem 4.2 

We begin with a Lemma which was proved in (Marbach and Tsitsiklis (1998)). It relates the gradient 
of the average reward per stage to the differential value function. 

Lemma B.l The gradient of the average reward per stage can be expressed by 

Ve7?(e)= n^.u,y,9)i;{x,u,9)h{y,9). (31) 

For completeness, we present a proof, which will be used in the sequel. 
Proof We begin with Poisson's equation (5) in vector form 

h{9) =f-eri{9) + P{9)h{9), 

where e is a column vector of I's. Taking the derivative with respect to 9 and rearranging yields 

eVgr^{9) = -Vgh{9) + VgP{9)h{9) + P{9)Vghi9). 

Multiplying the left hand side of the last equation by the stationary distribution n{9y yields 

VgTj{6) = -TT{9)'Veh[9) + TT [9)' VgP{9)h{9) + tt {9)' P{9)^gh{9) 
= -TT {9)' Vgh{9) + n (9)' VgP{9)h{9) + tt {9)' V gh{9) 
= TT{9)'VgP{9)h{9). 
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Expressing the result explicitly we obtain 



(32) 



x,y£X 

= P(x)vJ5](P(2;|x,«)m(u|x,0)) J 

x,yeX \ u J 

x.yeX u 

= Piy\x,u)Pix)Vef^iu\x,e)hiy,9) 

= Y Piy\x,uUu\x,e)P{x)^^^^^:^hiy,9) 

= Y Pix,u,y,e)^{x,u,e)h{y,9). 



Based on this, we can now prove Theorem 4.2. We start with the result in (32). 

^em^ Y P{x,u,y,e)xj;{x,u,e)h{y,e). 

= Y P{x,u,y,e)i;{x,u,e){h{y,e)^h{x,e)+f{x)-im+f{x)) 

x,y^X ,uGl4 

Y Pix,u,y,0)i}{x,u,e){~h{x,0)+f{x)~r]{9) + f{x)) 
= Y P{x,u,y,0)4>{x,u,e){d{x,y,e) + f{x)) 

x,y^X ,u^U 

- Y P[x,u,y,e)^{x,u,e)[-h{x,e) + f{x)-im + f{x)) 

x,yGX ^u^hi 

In order to complete the proof, we show that the second term equals 0. We define F{x, 0) = 
—h{x\9) + f(x) — ri{9) + f{x) and obtain 

Y Pix,u,y,9)iP{x,u,9)F{x,9) = YAx,9)F{x,9) Y ^eP{y\x,u,e) 

x,yeX,u£U x£X ueU,y£X 

=0. 

Appendix C. Proof of Theorem 4.6 

As mentioned earher, we use Theorem 6.1.1 of Kushner and Yin (1997). We start by describing the 
setup of the theorem and the main result. Then, we show that the required assumptions hold in our 
case. 

C.l Setup, Assumptions and Theorem 6.1.1 of Kushner and Yin (1997). 

In this section we describe briefly but accurately the conditions for Theorem 6.1.1 of Kushner and 
Yin (1997) and state the main result. We consider the following stochastic iteration 

yn+i = Tlniyn + InYn], (33) 
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where Yn is a vector of "observations" at time n, and Hh is a constraint operator as defined in 
Definition 4.5. Recall that {xn} is a Markov chain. Based on this, define !Fn to be the cr-algebra 

J^n = a{yo,Yi^i,Xi \i < n} 

= cr{yQ,Yi-i,Xi,yi\i < n}, 

and 

= (7{yo,Yi^i,yi \i <n}. 

The difference between the ti-algebras is the sequence {a;„}. Define the conditioned average iterate 

gn{yn,Xn) = E [y„ | ] , 

and the corresponding martingale difference noise 

SM„^Yr,~E[Y„\Tn]. 

Thus, we can write the iteration as 

yn+l = yn + In {9n {V 

where Z„ is a refiection term which forces the iterate to the nearest point in the set H whenever the 
iterates leaves it (see Kushner and Yin (1997) for details). Next, set 

S(2/) - E [gn{y,Xn) \^n] ■ 

Later, we will see that the sum of the sequence {5M„} converges to 0, and the r.h.s of the iteration 
behaves approximately as a the function g (y), which yields the corresponding ODE, i.e., 

y^g {y) ■ 

The following ODE method will show that the asymptotic behavior of the iteration is equal to the 
asymptotic behavior of the corresponding ODE. 
Define the auxiliary variable 



k=0 

and the monotone piecewise constant auxiliary function 

m (t) = {n\tn < t < tn+l } ■ 

The following assumption, taken from Section 6.1 of Kushner and Yin (1997), is required to establish 
the basic Theorem. An interpretation of the assumption follows its statement. 

Assumption C.l Assume that 

1. The coefficients {7n} satisfy J2'^=i Jn = oo and lim„^oo Jn = 0. 

(a) sup„E[|iy„||] < oo. 

(b) gn iyn,x) is continuous in y„ for each x and n. 

(c) For each ^ > and for some T > there is a continuous function g {■) such that for each 

y 

X! (f" ~ 5 (y)) > M I =0. 

i=rn(jT) 



lim Pr sup max 

j>nO<t<T 
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(d) For each fj. > and for some T > we have 



hm Pr sup max 

j>nO<t<T 



m{jT+t)-l 
i=Tn{]T) 



0. 



(e) There are measurable and non-negative functions ps (y) and pn4 (x)such that 

hn {y7i,x)\\ < P3 {y)pn4{x) 

where ps (y) is bounded on each bounded y-set , and for each p > we have 



m{jr+r) — 1 

lim lim Pr sup 7iP«4 {xt) > p 



0. 



(f) There are measurable and non-negative functions pi (y) and p„2 (a;)such that pi {y) is 
bounded on each bounded y-set and 



where 



and 



hn iyi,x) - gn {y2,x)\\ < pi {yi - y2)Pn2 {x) , 



lim Pi (y) = 0, 



(m{tj+T) 
lim sup ^ jipi2 (xi) < oo \ ^ 1. 

The conditions of Assumption C.l are quite general but can be interpreted as follows. Assumptions 
C. 1.1-3 are straightforward. Assumption C.l. 4 is reminiscent of ergodicity, which is used to replace 
the state-dependent function 5n (■, ■) with the state- independent of state function 5 (•), whereas As- 
sumption C.l. 5 states that the martingale difference noise converges to in probability. Assumptions 
C.l. 6 and C.l. 7 ensure that the function gn (■, ■) is not unbounded and satisfies a Lipschitz condition. 

The following Theorem, adapted from Kushner and Yin (1997), provides the main convergence 
result required. The remainder of this appendix shows that the required conditions in Assumption 
C.l hold. 

Theorem C.2 (Adapted from Theorem 6.1.1 in Kushner and Yin (1997)) Assume that algorithm 
1, and Assumption C.l hold. Then y„ converges to some invariant set of the projected ODE 

y = ^H[g{y)]- 

Thus, the remainder of this section is devoted to showing that Assumptions C.l.l-C.1.7 are satisfied. 
For future purposes, we express Algorithm 1 using the augmented parameter vector ?/„ 



Vn = (6'„ W„ ?/„) 



(34) 



The components of Yn are determined according to (27). The corresponding sub- vectors of g{yn) 
will be denoted by 

9{yn)^[g{en)' g{wj 5 (77™)']' e M^wi, 

and similarly 



9n {yn ; Xji ) — \gn {^n : Xn ) gn ('^n ; -^n) 9n ij)n ; Xn ) ] € 



T,K+L + 1 



21 



Di Castro and Meir 



We begin by examining the components of 5„ (y„, a;„) and g {Un)- The iterate .g„ {•!)„ , Xn) is 

gn{fln,Xn) = E [ (r (Xn) - f)„ ) | J^„] 
= ^■q{'r{Xn)-f)n), 

and since there is no dependence on Xn we have also 

givn) = (?/ (6*) - 77„) . 

The iterate gn {wn,Xn) is 



(35) 



5„(W„,X„) = E Ty,d{Xn,Xn+l,Wn)ej 



E 



n ) '^n ) 



k=0 



T 

^ n. 



(36) 



Fu, ^ A''^ (a;„_fe) I r (x„) - r/n + ^ -P (?;|a:;„, 6'„) <j> (y)' Wn - (j) {xn)' ^ 



k=0 



and the iterate g (wn) is 

g(Wn) = E[gn{Wn,Xn)\^n] 



E 



r^, ^ A''(^ (a;„_fe) r (a:„) - fj^ + ^ P {y\xn,9n) 4^ iv)' Wn ~ [Xn)' Wn 



k=0 



yex 



T 



which, following, Bertsel-cas and Tsitsiklis (1996) section 6.3, can be written in matrix form 

(oo \ oo 

(1 - A) ^ A'^P^'+i - / Uu;„ + $'n (0„) ^ A'-^P'^ (r - 77„) . 
fc=0 / fc=0 

With some further algebra we can express this using (26), 

g (wn) = A [On] Wn + b (6'„) + G {On) {v {6 
Finally, the iterate gn {On, Xn) is 



9n {O71 1 ^n) 



E 



d{Xn,Xn+l,Wn) 1p {Xn.Un, 9n) 



T 

J n 



E [d{Xn, Xn+l,0n) 1p (x„, U„, 6I„)| Tn] 



+E 



(37) 



d{Xn,Xn+l,Wn) - d {Xn , Xn+1 , Bn) j (x„ , M„ , 6'„ ) 
E [d{Xn, Xn+1, On) Tp (x„ , U„ , 6'„) | Tn] 

+ ^ P{z\Xn)'4'iXn,Un,0n) (^(I (x„ , Z, W„) - d (x„ , Z, 6'„)^ 



and 

9 (On) 



= E 



d{Xn,Xn+l,Wn)'>p iXn,Un,9n) 



E \d(Xn,Xn+\,Q'n)i)(x ) 1p{Xn,Un,0n) 

V?7(6'„) + E E (x) P(w|a;,6'„) P(y|a;,u)'(/' (a:,u,6'„) (^^(x, u;„) - d{x,y,0n) 
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Next, we show that the required assumptions hold. 
C.2 Satisfying Assumption C.1.2 

We need to show that sup„ E [IjFnIlj] < oo. Since later we need to show that sup„ E ||F„||2 < oo, 
and the proof of the second moment is similar to the proof of the first moment, we consider both 
moments here. 

|2 



Lemma C.3 The sequence ?7„ is hounded w.p. 1, sup„ ('7n)ll2] < anrf sup„ E \\Yn {fj. 



< 



Proof We can choose AI such that jn^ri < 1 for all n > M. Using Assumption 3.4 for the 
boundedness of the rewards, we have 



fin+i = (1 - 7«r^)77m + 7„r,,r(a;„) 

< (1 - 7„r^)77„ + Jn^TjBr 

fjn if fjn > Br, 

Br if fjn < Br, 

< ma.x{fin,Br}, 



< 



(38) 



which means that each iterate is bounded above by the previous iterate or by a constant. We denote 
this bound by Bfj. Using similar arguments we can prove that fjn is bounded below, and the first 
part of the lemma is proved. Since fjn+i is bounded the second part follows trivially. ■ 



Lemma C.4 We have sup„ E ||F„ (wri)|l2 < °° o.'^'d sup„ -EiHi^n (w')i)|l2] < ^ 
Proof For the first part we have 



E 



WYniw^m 



E 



(a) , 



< r^.E 



J^iod (x-n,, Xt^-^i , Wfi^ G 
oo 

A''(?!) (x„_fe) (r {Xn) - Vn + (f> {Xn+l)' Wn " [Xn)' Wn) 

k=0 

TO 

\\(j>{xn-k) {r {xn) - fjn + (/) (x„+i )' w„ - (/) (a;„ )' w„) 

.fe=0 

sup \\4l{Xn-k) {r {Xn) - fjn + (x„+l)' W„ - (x„)' W„) 1 1 2 ^ A' 



(1-Ar 



fe=0 

l'^(a;n-fc)ll2 + I'MI^ + Il'?^(3^n+l)ll2 ' Il'^«ll2 + H i^n)\\l ' \\Wr, 



< 



(1-Ar 



where we used the triangle inequality in (a) and the inequality (a + 6)^ < 2a^ + 26^ in (6). The 
bound sup„ E (w„)|j2] < oo follows directly from the Cauchy-Schwartz inequality. ■ 



Lemma C.5 We have sup„ E ||F„ i 
in Lemma C.4- 

Based on Lemmas C.3, C.4, and C.5 we can assert Assumption C.1.2 



< oo and sup„ -B[||l'n (^'ti)||2] < oo.The proof proceeds as 
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C.3 Satisfying Assumption C.1.3 

Assumption C.1.3 requires the continuity of gn {yn, Xn) for each n and a;„. Again, we show that this 
assumption holds for the three parts of the vector y„ . 

Lemma C.6 The function gn {fimXn) is a continuous function of i)n for each n and x„. 

Proof Since g„ {fjm Xn) — (r (a;„) — fjn) the claim follows. ■ 



Lemma C.7 The function gn (w„,x„) is a continuous function of fin, Wn, and 9n for each n and 

Xn . 

Proof The function is 

gn (WnjXn) = Tu, ^ A''0 (x„_fe) r {Xn) - Vn Wn - (t>{Xn) Wn ■ 

k \ y&X J 

The probability transition J2yex ^ {y\^ni &n) can be written as X!yeAr,uew ^ {v\^ni Un) 

The function ^ (u„|a;„, On) is continuous in On by Assumption 3.6, and thus 5„ (w„, a;„) is continuous 

in T)n and On and the lemma follows. ■ 



Lemma C.8 The function gn {On,Xn) is a continuous function of fin, Wn, and On for each n and 

Xn • 

Proof By definition, the function .g„ (On,Xn) is 

gn {On : Xn ) = E\d{Xn,Xn+l,Wn)lpix )|-^n] 

= (^"1^"'^") j J, -fjn + '^P {y\Xn, On) <P (?/)' Wn - (j) (x„)' Wn 

fi [Un\Xn, ^n) \ y£X 

Using similar arguments to Lemma C.7 the claim holds. 



C.4 Satisfying Assumption C.1.4 

In this section we prove the following convergence result: for each n > and for some T > there 
is a continuous function g {■) such that for each y 



lim Pr sup max 

i>nO<t<T 



m(jT+t)-l 

lii9n{y,Xi) - g{y)) 

i=m(jT) 



>n\. (39) 



We start by showing that there exist independent cycles of the algorithm since the underlying Markov 
chain is recurrent and aperiodic. Then, we show that the cycles behave as a martingale, thus Doob's 
inequality can be used. Finally we show that the sum in (39) converges to w.p. 1. We start 
investigating the regenerative nature of the process. 

Based on Lemma 3.2, there exists a recurrent state common to all MC{0), denoted by x* . We 
define the series of hitting times of the recurrent state x* by to = 0, ti, t2, where tm it the 7Ti-th 
time the agent hits the state x* . Mathematically, we can define this series recursively by 

tm+i=vai{n\xn^x*,n>tni}, to = 0, 
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and T„i = t„i+i — t,„. Define the m-th cycle of the algorithm to be the set of times 

%n = {?^|^m-l <n < tjn}, (40) 

and the corresponding trajectories 

Cm = {Xn\n e Tm}. (41) 

Define a function, g{k), which returns the cycle to which the time k belongs to, i.e., 

g{k) = {m,\k e %n} ■ 

We notice that based on Lemma 3.3, and using the Regenerative Cycle Theorem (see Bremaud 
(1999), pp. 87), the cycles Cm are independent of each other. 

Next, we examine (39), and start by defining the following events: 



6(2) A 

n 

61=^) ^ 



sup max 

j>nO<t<T 



sup sup 

j>n k>m{jT) 



„i(jT+t)-l 

lt{gi{y,Xi) ~ g{y)) 

i=niUT) 
k 

X! lii9i{y,Xi) - g{y)) 

i=m(jT) 



sup 



Yliim {y,Xi)~ g{y)) 



It is easy to show that for each n we have b\}^ C b'"n \ thus, 

Prr6«Wprr6i2 



(42) 



It is easy to verify that the series is a subsequence of l^™^"*!- Thus, if we prove that 

lim„^oo Pr (^"^^) ~ 0' then Um„^oo Pr {bn) = 0, and using (42), Assumption C.1.4 holds. 

(3) 

Next, we examine the sum defining the event 6„ , by splitting it a sum over cycles and a sum 
within each cycle. We can write it as following 



Ylii.g^{y■:X^) - g{y)) = ^ Yl^{gi{y,Xi)~ g{y)). 

i=n m=g(n) i£Tm 

Denote c™ ^ Ejer, 7* (ff" iv^ ^0 ~ 9 iv))- Therefore, by the Regenerative Cycle Theorem (Bremaud 
(1999), pp. 87), Cm are independent random variables. Also, 



E [Cm] = E 



li {gi {y,Xi)-g{y)) 



= E 



E 



Y 7i i9n {y,Xi) - g{y)) 



T 

^ m 



We argue that Cm is square integrable. To prove this we need to show that the second moments of 
Tm and (g„ (y, Xi) - g (y)) are finite. 

Lemma C.9 

1. The first two moments of the random times {Tm} are bounded above by a constant Bt, for 
all 9 g and for all m, 1 < m < oo. 
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(a) E ^{g„ iy,Xi)~g{y)) J < Bg 

(b) Define 7,„ = sup^^j-^ 7,, then Em=o 7m < 00. 

(c) E [cl] < {BrBgf. 



Proof 



1. According to Assumption 3.2 and Lemma 3.3, each Markov chain in V is recurrent. Thus, for 
each 9 G M.^ there exists a constant Bt{0), < Bt{0) < 1, where for k < \X\ we have 



P{T,n ^ k\e„,) < {Bt 



[k/\X\] 



1 < m < 00, 1 < A: < 00, 



(43) 



where [aj is the largest integer which is not greater than a. Otherwise, if for k > \X\ we have 
BriOm) ~ 1 then the chain transitions equal 1 which contradicts the aperiodicity of the chains. 
Therefore, 



00 00 



[k/\xu 



= Sti(6',„) < 00, 



fc=i 



fc=i 



and 



k=l 



Since the set V is closed, by Assumption 3.2 the above holds for the closure of 7^ as well. Thus, 
there exists a constant Bt satisfying Bt = maxjsupg Bti {0), supg i3T2(^)} < oo- 

(a) The proof proceeds along the same lines as the proofs of lemmas C.3, C.4, and C.5. 

(b) The result follows trivially since the sequence {7m} is subsequence of the summable 
sequence{7m}. 

(c) By definition, for large enough m we have 7m < 1- Therefore, we have 



E 



< E 



^3 i9n{y,Xj) - g{y)) 



\%n\'' ( sup7j^ (^sup {gn{y,Xj) - g{y)) 



< BlBl 



Next, we conclude by showing that Assumption C.1.4 is satisfied. Define the process fi„ = X]m=o 
This process is a martingale since the sequence {cm} is square integrable (by Lemma C.9) and satisfies 
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E [o?,„+i|d„i] = d„i. Using Doob's martingale inequality^ we have 



e(fe) 



E 



Pr sup V V 7i (5n {y, x^) - g {y)) > < lim 

I k>n I n— >oo 



Ejer„ 7j (5n iv, Xj) ~ g (y)) 



m—g{n) j^Tj-n 



EOO -p 
m—g{n) 



= lim 

n — >oo ^ 

oo 

< lim "flBgBT/fi^ 

771— ^(n) 

= 0. 

C.5 Satisfying Assumption C.1.5 

In this section we need to show that for each /i > and for some T > we have 



(EjeT™ 7j (dn (y, Xj) - g (y)) 



lim Pr sup max 



0<t<T 



>Ai =0. 



(44) 



In order to follow the same lines as in Section C.4, we need to show that the second moment of the 
martingale difference noise, SAIi, is bounded with zero mean. By definition, (5M„ (•) has zero mean. 

Lemma C.IO The martingale difference noise, SMn (•), is bounded in the second moment. 

Proof The claim is immediate from the fact that 



E 



I 9n ijjn ; -^n ) || 



< 2E 



|>"ri||^ + \\gn {yn,Xn)\\^ 



and from Lemma C.3, Lemma C.4, and Lemma C.5. 



Combining this fact with Lemma C.IO, and applying the regenerative decomposition of Section C.4, 
we conclude that statistically 5Mn (•) behaves exactly as {gn (y, Xi) — g (y)) of section C.4 and thus 
(44) holds. 



C.6 Satisfying Assumption C.1.6 

In this section we need to prove that there are non-negative measurable functions (y) and p„4 (x) 
such that 

llffn {yn,x)\\ < P3 (y„) PnA (x) , (45) 

where ps (y) is bounded on each bounded y-set, and for each /x > we have 

/ m(jT+T)-l \ 

lim lim Pr sup 7ip„4 (xi) > p \ =0. 

T—tOn—fOO \ j>„ ^ — ' ; 



i—rn^jr) 



The following lemma states a stronger condition for Assumption C.1.6. In fact, we choose psiy) to 
be a positive constant. 

Lemma C.ll If \\gn (y,a;)|| is uniformly bounded for each y, x and n, then Assumption C.1.6 is 
satisfied. 



1. If w„ is a martingale sequence then Pr (sup„>Q |tu„| > ^) < lim„-^oo E //i^ 
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Proof Let us denote the upper bound by the random variable B, i.e., 

\\9n {.y,x)\\ < B, w.p. 1. 

Thus 

/ m(^'r+r) — 1 \ / m(jr+r) — 1 

lim lim Pr sup > JiPni (a;;) > M ^ 1™ 1™ Pr sup > jiB > fi 



i=m(jT) 



j>n 



i=m{jT) 



m(^'r+r) — 1 

lim lim Pr sup B > li ^ 

r— >0n— tcxD \ 



< lim Pr (St > yu) 

T— >0 

= 0. 



i=m{jT) 



Based on Lemma C.ll, we are left with proving that gn {y, x) is uniformly bounded. The following 
lemma states so. 

Lemma C.12 The function g„ iy,x) is uniformly bounded for all n. 
Proof We examine the components of gn {ymXn). In (35) we showed that 

gn iVn, Xn) = P,, (r (a;„) - fin) . 

Since both r (xn) and r/n are bounded by Assumption 3.1 and Lemma C.3 respectively, we have a 
uniform bound on gn (?7n,a;„). Recalling (36) we have 



gn [Wn,Xn) = T^, ^ X'' (j) {Xn-k) | r (x„) - l]n + P {y\Xn,0n) [y)' Wn - <j) (Xn)' 'i 

yex 



< r„ 



k=0 
1 



1 - A 

Finally, recalling (37) we have 



Bcf, {Br + Bfi + 2B^Bu,) ■ 



gn (^n ; Xn ) = E\d{Xn,Xn+l,Wn)lp{Xn,Un,dn) 

< {B,. + Bfj + 2B^Biu) B^. 



J n 



C.7 Satisfying Assumption C.1.7 

In this section we show that there are non-negative measurable functions pi {y) and p„2 (x) such 
that pi (y) is bounded on each bounded y-set and 

hn {yi,x) - gn (j/2, a:)|| < Pi {yi - y-i) Pn2 (x) (46) 

where 

limpi(2/) = 0, (47) 
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and for some r > 

(m(t,+T) \ 
limsup ^ 7iPi2 {xi) < oo =1. 

From Section C.6 we infer that we can choose p„2 {x) to be a constant since g„ {y,x) is uniformly 
bounded. Thus, we need to show the appropriate pi (•) function. The following lemma shows it. 

Lemma C.13 The following functions satisfy (46) and (4V- 

1. The function pi (?;) = II772 - 7?i !| and p„2 (x) = for g„ (jy,^). 



(a) The function pi (y) = j^B^ (E^e;, S„ ||P 0i) - P 02)|| + H"^! - t«2|| j and 
Pn2 {x) = F^ for g„ {w, x). 

(b) The function pi (y) = J2yex Bw\\P iy\x,ei) - P {y\x,e2)\\ ■ B.^, and p„2 (x) = 1 for 

ffn (6',X). 



Proof 



1. Recalling (35) we have for gn (Vjx) 

\\gn{m,x) - gn{m,x)\\ < F^ II772 - 7?l|| 

thus (46) and (47) are satisfied for 1. 

2. Recalling (36) we have for gn {w,x) 



\\gniwi,x) - gniw2,x)\\ < 



Wi 



Y^P{y\x,62)d,{y)' W2- (t> (Xn) W2 



< 



< 



1 - A 



1- A 



J2 \\P{y\x,ei)wi- P{y\x,e2)w2\\ + i|wi -W2II 



J2 Bu, \\Piy\x, Oi) - P {y\x, 02)11 + IK^^i - W2\\ 



(a) Trivially, with respect to w (46) and (47) are satisfied. Regarding 0, (46) and (47) are 
satisfied if we recall the definition of P {y\x, 9) from (1) and the continuity of p {u\x, 9) 
from Assumption 3.4. 

(b) Recalling (37) we have for gn (6*, x) 

\\gn{Si,x) - gn{92,x)\\ = \\^F.\^d{x,y,wi)'4>{x,u,9i) Tn -E d{x,y,'W2)'4' {x,u,92) 

< ^ IIP (y|a;,0i)-P(y|x, 02)11 -B^. 
vex 



Using similar arguments to 2, (46) and (47) are satisfied for 9. 
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Appendix D. Proof of Theorem 4.7 

In this section we find conditions under which Algorithm 1 converges to a neighborhood of a local 
maximum. More precisely, we show that liminft_foo II V'?(^'(i))ll2 — ^app + Cdym where the approx- 
imation error, fapp, measures the error inherent in the critic's representation, and edyn is an error 
related to the single time scale algorithm. We note that the approximation error depends on the 
basis functions chosen for the critic, and in general can be reduced only by choosing a better repre- 
sentation basis. The term e^yn is the dynamic error, and this error can be reduced by choosing the 
critic's parameters and appropriately. 

We begin by establishing a variant of Lyapunov's theorem for asymptotic stability^, where instead 
of proving asymptotic convergence to a point, we prove convergence to a compact invariant set. Based 
on this result, we continue by establishing a bound on a time dependent ODE of the first order. 
This result is used to bound the critic's error in estimating the average reward per stage and the 
differential values. Finally, using these results, we establish Theorem 4.7. 

We denote a closed ball of radius y in some normed vector space, (R^, || • II2), by By, and its 
surface by dBy. Also, we denote by A\B a set, which contains all the members of set A which are 
not members of B. Finally, we define the complement of By by By = R^\;Sy. 

The following lemma is similar to Lyapunov's classic theorem for asymptotic stability (KhaHl 
(2002), Theorem 4.1). The main difference is that when the value of the Lyapunov function is 
unknown inside a ball, convergence can be established to the ball, rather than to a single point. 

Lemma D.l Consider a dynamical system, i ^ f (x) in a normed vector space, (K^, || • ||), and a 
closed ball Br = {a: |x G M^, ||.t|| < r} . Suppose that there exists a continuously differentiable scalar 
function V (x) such that V (x) > and V (x) < for all x e B^, and V (x) = for x G dBr- Then, 



Proof We prove two complementary cases. In the first case, we assume that x (t) never enters 
Br. On the set Br, V (x) is a strictly decreasing function in t, and it is bounded below, thus it 
converges. We denote this bound by C, and notice that C > since for x G Br, V {x) > 0. We 
prove that C = by contradiction. Assume that C > 0. Then, x{t) converge to the invariant set 
Sc — {x\V {x) = C,x £ Br}. For each x{t) G Sc we have V (x) < 0. Thus, V {x) continues to 
decrease which contradicts the boundedness from below. As a result, V{x (t)) 0. 

In the second case, let us suppose that at some time, denoted by tg, x{to) G Br. We argue that 
the trajectory never leaves Br. Let us assume that at some time ^2, the trajectory x{t) enters the 
set dBr+e- Then on this set, we have V{x{t2)) > 0. By the continuity of the trajectory x{t), the 
trajectory must go through the set dBr. Denote the hitting time of this set by ti. By definition we 
have V{x{ti)) = 0. Without loss of generality, we assume that the trajectory in the times ti < t < t2 
is restricted to the set Br+e/Br. Thus, since V{x{t)) < for x G Br+t/Br we have 



which contradicts the fact that V{x{t2)) > V{x{ti)). Since this argument holds for all e > 0, the 



The following lemma will be appHed later to the linear equations (27), and more specifically, to 
the ODEs describing the dynamics of 77 and w. It bounds the difference between an ODE's state 
variables and some time dependent functions. 

2. We say that the equilibrium point a; = of the system x={{x) is stable if for each e > there exists a <5 > such 
that \\x (0)11 < 5 (t)|| < e for all t > 0. We say that the point x = is asymptotically stable if it is stable 

and there exists a <5 > such that \\x (0)|| < & implies limt_>oo x{t) = (see Khalil (2002) for more details). 



limsup (t) || < r. 




tl 



trajectory x(t) never leaves Br. 
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Lemma D.2 Consider the following ODE in a normed space (M^, || • !|2) 

jX (<) = M it) {X it) F,{t)) + F2{t), ^^^^ 
X{0) = Xo, 
where for sufficiently large t . 

1. M{t) e M.^^^ is a continuous matrix which satisfies max||2.||^i x'M (t) x < —7 < for t eM., 

2. Fi (t) e satisfies \\dFi{t)/dt\\2 < Bpi, 

3. F2 (<) G satisfies \\F2{t)\\2 < Bf2. 

Then, the solution of the ODE satisfies limsupt^g — Fi (t) II2 < (Bfi + Bp2) /j. 

Proof We express (48) as 

I {X{t) - F,{t)) = Mit) (Xit) F, (t)) - f^F,{t) + F2{t), (49) 

and define 

Z{t) ^ iX{t) - F,it)) , Git) ^ ~j/i{t) + F2{t). 
Therefore, (49) can be written as 

Z{t)^ M{t)Z{t) + G{t), 
where \\G{t)\\ < Ba = Bpi + Bpi- In view of Lemma D.l, we consider the function 

V{Z) = \{\\Z{t)\\l-Blh^). 

Let Br be a ball with a radius r ~ BqI^. Thus we have V (Z) > for Z E and V{Z) = for 
X e dBr- In order to satisfy the assumptions of Lemma D.l the condition that V{Z) < needs to 
be verified. For ||Z(t)||2 > Sg/7 we have 

ViZ) = iVxV)'Z{t) 

= z{tyM{t)z{t) + z{tyG{t) 

< \\Z{t)\\l ^^^max^^y(i)'X(t)y(t) + \\Z{t)\\, \\G{t)\U_ 

= \\Zm2i-i\\Zm, + BG) 

< 0. 

As a result, the assumptions of Lemma D.l are valid and the Lemma is proved. ■ 

The following lemma shows that the matrix A (9), defined in (26), satisfies the conditions of Lemma 
D.2. For the following lemmas, we define the weighted norm ||w||n(e) ^ \\w'Il {9) wW^- 

Lemma D.3 The following inequalities hold: 

1. For any w G K^and for all 9 e M^, \\P (9) w\\-^f^g^ < ||w||n(e)- 
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(a) The matrix M (9) satisfies \\M{e)w\\^f^g^ < |lw|ln(0) for all 6* € and w € M^. 

(b) The matrix H (9) (M (9) - I) satisfies x'n (9) (M (61) - /) a: < for all xeR^ and for all 
9 e R^. 

(c) There exists a positive scalar 7 such that w'A (9) w < —7 for all w'w = 1. 

Proof The following proof is similar in many aspects to the proof of Lemma 6.6 of Bertsekas and 
TsitsikHs (1996). ■ 



1. By using Jensen's inequality for the function / (a) = we have 

Y,Piy\^,0)wiy)\ <Y,P{y\x,9)w{y)\ Vx e 



(50) 



If in Jensen's inequality we have a strictly convex fiction and non-degenerate probability mea- 
sures then the inequality is strict. The function / {a) is strictly convex, and by Assumption 
3.2 the matrix P {9) is aperiodic, which implies that the matrix P {9) is not a permutation 
matrix. As a result, there exists xq G X such that the probability measure P{y\xo,9) is not 
degenerate, thus, the inequality in (50) is strict, i.e., 

2 



Y,P ivl^o, 0)w{y)\ <J2P{y\^o,0)wiyy 

yyeX I yeX 



(51) 



Then, we have 



\P{9)'. 



In(,) = w'P{9yn{9)Pi9)w 

xex \yex 
xex yex 

yeX xGX 

= ^w{yfTT{y\9) 



yex 



n(e) 



where in the inequality we have used (51). 
(a) Using the triangle inequality and 1 we have 



(1 - A) E A"P(6i)'"+^ 

00 

< (i-A)E^" p{er'^^w 

m=0 
00 

< (l-A)E^" 



n(e) 
n(e) 



n(e) 



711 — 



w 



n(e) 
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(b) By definition 



c'u{e)M{e)x = x'n{ey^^u{e)'^'' M{e) 



>l/2 



< 



Hie) 



1/2, 



< 



ll^lln(0) ll^lln(e) ' 

where in the first inequality we have used the Cauchy-Schwartz inequality, and in the 
second inequality we have used 1. Thus, x'll (0) {M {9) — I) x < for all a; € R, which 
implies that 11 (6) {M {6) — /) is a negative definite (ND) matrix^. 

(c) From 3, we know that for all 9 e and all w e E''^! satisfying w'w — 1, we have 
w'U (9) [M {9) — I)w < 0, and by Assumption (3.2), this is true also for the closure of 
{n (9) (M {9) - I)\9 eR^}. Thus, there exists a positive scalar, 7', satisfying 

w'U (9) {M (9) - /) w < -7' < 0. 

By Assumption 4.3 the rank of the matrix <& is full, thus there exists a scalar 7 such that 
for all w e R^, where w'w = 1, we have w'A (9) w < —7 < 0. 

The following Lemma establishes the boundedness of 9. 

Lemma D.4 There exists a constant Bgi = B,^i + _B^, {Bd + B,. + B,^ + 2B^Buj) such that \\9\\2 < 
Bei- 

Proof Recalling (27) 



x,yeXxX,ueU 

<i?„i+ \\D^''^'''-'^\9)\\^\\d{x,y,9)-d{x,y,w) 

x.yeXxX.ueU 

< B,^i + B^p {Bd + Br + Bfj + 2Bcf,Byj) 

— Bgi. 



Based on Lemma (D.4), the following Lemma shows the boundedness of {ri{9{t)) — 77). 
Lemma D.5 We have 

limsnp\r,i9it))-fi\<?^, 

where Sa?) — Br^Bgi. 

Proof Using the Cauchy-Schwartz inequality we have 

< B^^Bgi. 



(52) 



3. Usually, a ND matrix is defined for Hermitian matrices, i.e., if B is an Hermitian matrix and it satisfies x'Bx < 
for all X eC^ then B is a NSD matrix . We use here a different definition which states that a square matrix B 
is a ND matrix if it is real and it satisfies x' Bx < for all x g M''' (see Horn and Johnson (f 985) p. 399). 
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Recalling the equation for ij in (27) we have 

We conclude by applying Lemma D.2 and using (52) that 

limsup|,(^(t))-^|<%^ = ^. 



(53) 



In (53) we see that the bound on \r]{9) — iyj is controlled by F^, where larger values of ensure 
smaller values of \ri{9) — fi\. Next, we bound \\w*{0) — w||2- We recall the second equation of (27) 



w 
A {9) 

M[9) 
h{9) 
G{9) 



[F^ [A {9)w + b {6) + G{e)ir^{9) - r)))] , 

$'n (9) [M - 1) $, 



(1 - A) ^ A'"P (0)'"+' , 

m=0 

oo 

$'n {9) ^ A'"P (0)'" (r - 77 (0)) ^ 

m=0 
oo 

$'n (61) ^ A"P (61)™ . 



We can write the equation for w as 

w ^ -^u, [Tu, {A {9) (w - w* (9)) + Gi9){ri9) - ry))] , 
where w* — ~A{9)~^ b{9). In order to use Lemma D.2, we need to demonstrate the boundedness 



of 11^^*11. The following lemma does so. 



Lemma D.6 

1. There exists a positive constant, Bb = j-rj \xf LB^Br, such that j|6 (0)j|2 < Bb. 

(a) There exists a positive constant, Bg — jzrx LBq,, such that \\G {9)\\2 < Bq- 

(b) There exist positive constants, B — By^i {Br + Bj^) Bqi + Bpi (i?r + -B^) Bgi + B,jiBgi 
and Bbi == {Xf' B^BrB, such that we have b{9) < Bbi- 

(c) There exist constants Ba and Ba such that 

< 6a < ||A(0)||2 < Ba. 

(d) There exist a constants Bai such that 

\\Aie)\\^<BAi. 

(e) We have 
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(f) There exists a positive constant, such that 

d 



dt 



w 



Proof 



1. We show that the entries of the vector b [0) are uniformly bounded in 9, therefore, its norm is 
uniformly bounded in 9. Let us look at the i-th entry of the vector h (9) (we denote by [-J^- the 
j-th row of a matrix or a vector) 



WHO)], 



$'n {9) (0)" {r - Tl {6)) 

m=0 

< 5^A'"|[<i>'nWP(0)™(r-rK^))] 



< 



m=0 
oo 

m=0 
1 

1-A 



\x\ \x\ \x\ 

E E E n,, {9) [p (0)"]^, {n - V m 

1=1 j=l k=l 



I X I _B<j) Bj. , 



thus ||6(^^)||2 < jrx {xf^ LB,j,Br is uniformly bounded in 9. 

2. The proof is accomplished by similar argument to section 1. 

3. Similarly to section 1, we show that the entries of the vector b{9) are uniformly bounded in 
9, therefore, its norm is uniformly bounded in 9. First, we show that the following function of 
6 {t) is bounded. 



|(nfc,(0) [P{9ry in-r^{9)) 





V 


e{nu,{9) [P{9r\ 


in- 




< 




[n - 


~r,{9))9 




■f 


n,,(0) [VgP{9r\ 


in- 


-11 {9)) 9 




4- 




e {ri 




< SttI {Br + Bjj) ■ Bgi + 


Bpi 


{Br + Brf 



B. 
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where we used the triangle and Cauchy-Schwartz inequahties in the first and second inequalities 
respectively, and Lemmas 3.6 and D.4 in the second inequality. Thus, 



b{0) 



< 



< 



$'n {0) J2 -^'"^ {r-rj (6)) 



Tn=0 



£ A™|[<I>'n(0)F(C'(r-,;(^))], 



m— 

1 

1 - A 
Bbi- 



\x\ \x\ \x\ 
1=1 j=l k=l 



d 
It 



4. Since A{9) satisfies y'A{9)y < for all nonzero y, it follows that all its eigenvalues are nonzero. 
Therefore, the eigenvalues of A (6)' A {0) are all positive and real since A {0)' A {0) is a sym- 
metric matrix. Since by Assumption 3.2 this holds for all 9 E M.^ , there is a global minimum, 
bA, and a global maximum, Ba, such that 



B\ > A„ax {A (0)' A (0)) > A^in {A (0)' A (0)) > b\, V0 e 



where we denote by Amin (•) and Amax (•) the minimal and maximal eigenvalues of the matrix 
respectively. Using Horn and Johnson (1985) section 5.6.6, we have X^ax {A [0)' A {9)) = 
A{0)\\^^ thus, we get an upper bound on the matrix norm. Let us look at the norm of 

A[0)-' 



A{0)-' 



A{0)- 



thus, we the lower bound on 



A{9y 

{A{9)')-' A{9) 
= K...({A{9)A{9)'y^ 
= \/K^,,,{A{0)A{0)') 

= l/K,n,[{A{0)' A{0))' 

= l/A,„i„(A(0)'A(0)), 
^(^)"'|| is y^l/A,„in {A{0)'A{0)), i.e., 6a • 
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5. Let us look at the ij entry of the matrix -j^A{d), where using similar arguments to section 2 



we get 



< 



$'1 (n (9)) ({1 - A) A"P (0)"+^ -l]^ 



m=0 



$'n (9) ^ ( (1 - A) ^ A™F (0)"+' - / I $ 



m=0 



< B^Bj^i- -B^ TrBpiB^. 

1-A (1-A)' 

Since the matrix entries are uniformly bounded in 9, so is the matrix -^A {9)' -^A (9), and so is 
the largest eigenvalue of (0)' (9) which implies the uniform boundedness of || (0) [j^. 

6. For a general invertible square matrix, X (t), we have 







d 



dt dt 
Rearranging it we get 



dt 



-i^-{x{t)-'x{t)\ = -[x it)-' \x{t) + x [t]-' - {X [t]) 



dt 



d 
Hi 



(x{ty')=-x{t)-^j^{x{t))x{t)-^ 



Using this identity yields 



-A{9)-'^^{A{9))A{9)-' 



< 



A{9)-' 
\Bai- 



-/I (())-■ 



7. Examining the norm of 4zW* yields 



— 

dt 



!^^A{9)-'b{9)+A{9)-'j^b{9) 



< blBAl^^\X\^ B^Br+bAB 



1-A 



Bwi- 



We wish to use Lemma D.2 for (27), thus, we show that the assumptions of Lemma D.2 are vaHd. 
Lemma D.7 
1. We have 

\irasu-p\\w* {9{t)) - w{t)\\2 < ^Baw, (54) 

where 

Baw — 



Bwi + Bc^^ 



7 
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(a) We have 

limsup\\h{e{t)) - hiw{t))\\^ < 



where 

BAh^\X\L{BAraf ■ 



Proof 



1. Without loss of generaUty, we can eliminate the projection operator since we can choose B.^ 
to be large enough such that w*{9) will be inside the bounded space. We take M{t) ~ A (9), 
Fi (t) = ■w*{9(t)), and F2 (t) = G{9)(ri(9) — fj) . By previous lemmas we can see that the 
Assumption D.2 holds. By Lemma D.6 (6), ||iy*(6')||2 is bounded by B^^-^, by Lemma D.5 we 
have a bound on \{'i]{9) — fj)\, and by Lemma D.3 we have a bound on w'A (9) w. Using these 
bounds and applying Lemma D.2 provides the desired result. 

(a) Suppressing the time dependence for simplicity and expressing \\h{d) — h{w)\\oo using e^pp 
and the previous result yields 



\\h{9) - h{w)\\oo < \\h{9)-h{w)\\2 

= \\h{9) - h{w*) + h{w*) - h{w)\\2 

< \\h{9) - h{w*)\\2 + WHw*) ~ h{w)\\2 

For the first term on the r.h.s. of the final equation in (55) we have 

\\h{e) - h{w*)\\2 = ||(n(e)-^) (n{9)i) (^h{9) - h{w*) 



(55) 



< 



< 



h{9) - h{w*) 



n(e) 



(56) 



where we use the sub-additivity of the matrix norms in the first inequality, and Lemma 
3.6 and the (17) in the last inequality. For the second term on the r.h.s. of the final 
equation in (55) we have 



\\h{w*)-h{w)\\ 



Mw*{9)-w)r2 



k=i \i=i / 

^E ( fE'^'(^^)) fEK(^)-^o' 
k=i \ \i=i J \i=i 

k=l \l=l I \l=\ ) 
< \X\L\\w*{9)-w\\l 



(57) 



= \X\L{Bau,) ■ 
Combining (54)-(57) yields the desired result. 
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Using Lemma D.7 we can provide a bound on second term of (27). 
Lemma D.8 We have 



lim sup 

t — >oo 



J2 D^'='^-y\e)[d{x,y,e)-d{x,y,w) 

x,yeXxX,u<£U 



S — p 1 r ^AtdS^app 



where 



Ba 



tdl 



1 1 2B^ 

— ■ 2BqiBAhl, BAtd2 = ' BatiB^,, i?Atd3 = 



Proof Simplifying the notation by suppressing the time dependence, we bound the TD signal in 



the limit, i.e.. 



lim sup \d{x, y, 6) — d{x, y,vS)\ = lim sup 

t — ►oo t — ►oo 



{r{x) - i]{9) + h{y, 6) - h{x, 6)) ~ {r[x) - fj + h{y, w) - h{x, w) 



< lim sup \ri{6) — f/l + lim sup 2 h{9) — h(w] 

t — ►oo t — ►oo 

— + 2 ' ^^''^ I ^^PP 



With some more algebra we have 



lim sup 



J2 D^^'-'y\e)[dix,y,9)~dix,y,w) 



x,yeXxX,ueU 



<limsup ^ IT (x) P {u\x,0n) P {y\x,u)\\ilj {x,u,6n)\\ ■ d{x,y,6) ~ d{x,y,w) 



x,y£XxX,u£U 



BAtda^app- 



r„ V r 

BAtdl , BAtd2 



We see that the term in this bound is adjustable by choosing appropriate F,, and F^,. The concluding 
lemma proves the conclusion of Theorem 4.7. 



Proof of Theorem 4.7 

We define 



„ A BAtdl , BAtdl , □ 

-DVr, — — p; i H -DAtdSEapp- 



For an arbitrary (5 > 0, define the set 

Bs = {e: ||Vr/(0)|| < Bv, + 5}. (58) 
We claim that the trajectory ri[9) visits Bs infinitely often. Assume the contrary that 

liminf ||V77(e')|L > Sv„ +(5. (59) 
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Thus, on the set Bg for t large enough we have 

f]{e) = Vr/(6i) • e 



\ x.yeXxX 

Xx.yeXxX 



(60) 



>l|V^WIi^-||V,7W| 

= l|Vr;(0)|l2(||Vr7(0)||2-i?v,) 

> \\yij{e)\\^{Bv, + S-Bv,) 

> {Bvrj + S)S. 

By (59), there exists a time to which for all t > to we have ij{9) e Bg. Therefore, 



J2 D^'^^y\e){d{x,y)-d{x,y) 



77(00) = 77(to) + / r]{9)dt > Tj{to) + {Bd+ S)Sdt = 00, (61) 

Jto JtQ 

which contradicts the boundedness of i]{6). Since the claim holds for all S > 0, the result follows. 
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